Today we will be getting familiar with two common environments that bioinformaticians commonly operate in - Bash and R (RStudio). We are forunate to have access to a powerful linux server that we can use to execute all our code. First up we are going to try and access this server from the command line and then afterwards, we’ll start getting familiar with RStudio.
To access the server, we will use the SSH (Secure SHell) protocol which is a secure way to access a server from your own computer. The way that we will do this depends on the operating system running on your computer.
For Linux and Mac users, open up the Terminal app.
For PC users, you may have to first download an ssh client such as Putty however newer versions of Windows have a built in ssh client called OpenSSH.
For Mac users, there are other more sophisticated terminal option you can try out if you like. The one I use is is called iTerm2. Linux also has plenty of options but I like Terminator. In both of these you can split your window vertically or horizontally to have more than one terminal window open at the same time.
Once you open your terminal, you can access the server by typing the following:
ssh [username]@bioinformatics.nec-mf-proj01.cloud.edu.au/
where [username] is your username in the format of your first name followed by your last initial with no spaces. So for me, Andrew Calcino, my username would be andrewc. Once you hit enter, you’ll be asked for a password. This will be XXX. Once you are in, we can start to explore the server and to learn some basic tools for navigating in a Linux environment.
Let’s move into that files directory located within our home directory and have a look at its contents.
test_user_1@bioinfo:~$ cd ~/shared_data/tut_1
test_user_3@bioinfo:~/shared-data/tut_1$ ls
example.bam fastq_example.fq multi_fasta.fa single_fasta.fa
The ~ is a special character that just means home
directory. You can either use this with cd or you can
use a relative or absolute path to get to the files directory. In here
you can see three files. Print the contents of
single_fasta.fa to the screen using a command you’ve
already learnt. hint it’s not dog.
This is a fasta file and its a very simple format for storing
nucleotide of amino acid sequences. It consists of a header line
starting with a > followed by a description of the
sequence, and a second line containing this sequence. In this case the
sequence is wrapped to keep the line lengths limited. Now have a look at
the other fasta but use the command less to do so. When
you’re done looking at the sequence, hit the q key.
less is a good way to look at very large files that you
don’t want to print to the screen. You can scroll up and down with the
arrow keys or use the space bar to move down a page at a
time.
This file is like the first except that it contains two fasta
sequences which is why it is called a multi fasta. Let’s
have a look at fastq_example.fq choosing whatever method you like. This
one is slightly more complicated. It’s called fastq format
and in addition to having some nucleotide sequence, it also some strange
characters that actually represent quality scores. Fastq format files
are typically what you get straight off of sequencing machines and they
describe how confident it was in calling each nucleotide in the
sequence. Each entry in a fastq file covers four lines - a header line
starting with a @, the sequence line, the plus line which
is a vestigal remnant of when fastqs were first invented and largely
serves no purpose anymore, and a quality scores line.
The quality score line (line 4) encodes the confidence in each base
call using ASCII characters. Each character in the quality line
corresponds to one base in the sequence line and represents a
Phred quality score encoded as an ASCII character. Phred
scores represent the probability that a base call is correct with higher
Q scores corresponding to higher confidence calls. This is the formular
used to caluculate phred scores:
Phred score = -10 × log₁₀(error probability)
so for example, if our error probability was 10%, we would get a phred score of:
Q = -10 x log₁₀(0.1) = 10
That means:
Q10 = 1 error every 10 bases or 90% accuracy
Q20 = 1 error every 100 bases or 99% accuracy
Q30 = 1 error every 1000 bases or 99.9% accuracy
Q40 = 1 error every 10000 bases of 99.99% accuracy
The next file we are going to check out is example.bam.
Try and have a look at the contents of this using your favourite
method.
Surprise! This one isn’t a text file, rather bam files are in a human
unreadable binary format. To view this type of file we need to use a
package called samtools which has been preinstalled for
your convenience on this server.
test_user_1@bioinfo:~/shared-data/tut_1$ samtools view -h example.bam
@HD VN:1.4 SO:coordinate
@SQ SN:1 LN:248956422
@SQ SN:2 LN:242193529
@SQ SN:3 LN:198295559
@SQ SN:4 LN:190214555
@SQ SN:5 LN:181538259
@SQ SN:6 LN:170805979
@SQ SN:7 LN:159345973
@SQ SN:8 LN:145138636
@SQ SN:9 LN:138394717
@SQ SN:10 LN:133797422
@SQ SN:11 LN:135086622
@SQ SN:12 LN:133275309
@SQ SN:13 LN:114364328
@SQ SN:14 LN:107043718
@SQ SN:15 LN:101991189
@SQ SN:16 LN:90338345
@SQ SN:17 LN:83257441
@SQ SN:18 LN:80373285
@SQ SN:19 LN:58617616
@SQ SN:20 LN:64444167
@SQ SN:21 LN:46709983
@SQ SN:22 LN:50818468
@SQ SN:X LN:156040895
@SQ SN:Y LN:57227415
@SQ SN:MT LN:16569
@SQ SN:NC_001422.1 LN:5386
@SQ SN:NC_007605.1 LN:171823
@PG ID:STAR PN:STAR VN:2.7.1a CL:/g/data/pq84/software/STAR/bin/Linux_x86_64_static/STAR --runThreadN 16 --genomeDir /g/data/pq84/software/STAR/GRCh38/150bp --readFilesIn /g/data/pq84/rnaseq/AAA_3_projects/Allopurinol/JCU1701_R1_trimmed.fastq /g/data/pq84/rnaseq/AAA_3_projects/Allopurinol/JCU1701_R2_trimmed.fastq --outFileNamePrefix /g/data/pq84/rnaseq/AAA_3_projects/Allopurinol/JCU1701_ --outMultimapperOrder Random --outSAMtype BAM SortedByCoordinate --outSAMattrRGline ID:JCU1701 SM:JCU1701_l1 PL:ILLUMINA --outFilterMultimapNmax 51 --alignSJDBoverhangMin 10000 --sjdbOverhang 149 --quantMode GeneCounts
@PG ID:samtools PN:samtools PP:STAR VN:1.22 CL:samtools view -s 0.0000001 -b JCU1701_Aligned.sortedByCoord.out.bam
@PG ID:samtools.1 PN:samtools PP:samtools VN:1.22.1 CL:samtools view -h example.bam
@RG ID:JCU1701 SM:JCU1701_l1 PL:ILLUMINA
@CO user command line: /g/data/pq84/software/STAR/bin/Linux_x86_64_static/STAR --outSAMattrRGline ID:JCU1701 SM:JCU1701_l1 PL:ILLUMINA --sjdbOverhang 149 --runThreadN 16 --genomeDir /g/data/pq84/software/STAR/GRCh38/150bp --outFileNamePrefix /g/data/pq84/rnaseq/AAA_3_projects/Allopurinol/JCU1701_ --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts --outFilterMultimapNmax 51 --outMultimapperOrder Random --alignSJDBoverhangMin 10000 --readFilesIn /g/data/pq84/rnaseq/AAA_3_projects/Allopurinol/JCU1701_R1_trimmed.fastq /g/data/pq84/rnaseq/AAA_3_projects/Allopurinol/JCU1701_R2_trimmed.fastq
LH00196:47:22CNY5LT3:6:2135:6659:6832 163 1 16397254 255 100M = 16397297 144 GGGCTTTGCCCGCTGTGTGCCTGTCACCACCCACCAGCAGTCATGCCCTGGGCTTCCCAAATGGAGAGGTAGCAGGCAACGTTTTTAAAAAGAAAGAAAA FFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:199 nM:i:0 RG:Z:JCU1701
LH00196:47:22CNY5LT3:6:2135:6659:6832 83 1 16397297 255 101M = 16397254 -144 TGCCCTGGGCTTCCCAAATGGAGAGGTAGCAGGCAACGTTTTTAAAAAGAAAGAAAACAGGAAACTGTATTGTGTCGGGGGAGGCGGGAGGGAGATGAGGA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF555FFFFFFFFFFFFFFF55FF5FFF-F NH:i:1 HI:i:1 AS:i:199 nM:i:0 RG:Z:JCU1701
LH00196:47:22CNY5LT3:6:1108:20038:1048 99 1 211746504 255 2S99M = 211746661 258 TNCTAGAAAAAAAATGTCTGATGAGAGTAAGAATTCAAACTCCTAAATAAAAGTAATTTTTGATTTACCTATCATTTATTACCAGATTATCTATCCAGAAA F!FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:198 nM:i:0 RG:Z:JCU1701
LH00196:47:22CNY5LT3:6:1108:20038:1048 147 1 211746661 255 101M = 211746504 -258 AAAGACCAATAAGATAGATCTGAACTCATAGCTATCTTCTTTTTTAATATACTTTTGGAATTCTCAGTGAATTTGTTTGCTTGCCCAAAATATTATTTTCC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:198 nM:i:0 RG:Z:JCU1701
LH00196:47:22CNY5LT3:5:1222:3857:10597 99 2 128312738 255 101M = 128312890 253 CGGTGATTTAAGAGAATAAATTTAAAAAAAAATTCACAATAGGCTGGTGCGGTGGCTCACACCTGTCATCCCAGCACTTTGGCAGGCGAAGTCGGGCAGAT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:200 nM:i:0 RG:Z:JCU1701
Right, so what is all this nonsense? Bam format is more complicated again as it includes positional information on where each DNA sequence has been mapped to in the genome. The first part of a bam file is called the header and it contains a list of all the chromosomes in the genome this file is mapped to and their lengths, follwed by a bunch of information on how the file was created. The bulk of the file consists of tab delimited entries that describe a lot of information about the sequence, where its mapped to and how confident you can be about the mapping. For a full description of the bam format check this out but be warned, it might be tough to follow at this sstage of the course. As we progress through the next ten weeks, hopefully some of what they are talking about in there begins to make more sense to you. Just a note - SAM format is the same as BAM but in human readable format.
If you just want to have a look at a single entry, try this:
test_user_1@bioinfo:~/shared-data/tut_1$ samtools view example.bam | head -1
LH00196:47:22CNY5LT3:6:2135:6659:6832 163 1 16397254 255 100M = 16397297 144 GGGCTTTGCCCGCTGTGTGCCTGTCACCACCCACCAGCAGTCATGCCCTGGGCTTCCCAAATGGAGAGGTAGCAGGCAACGTTTTTAAAAAGAAAGAAAA FFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:199 nM:i:0 RG:Z:JCU1701
There are three parts to this command that should be understood.
First up is the samtools view command that we’ve already
seen but this time we didn’t use the -h option and so it
skipped the header section. Next up in a | symbol, known as
a pipe. This takes the output of one command and pipes
it through to be used as the input for a subsequent command. In this
case, that command is head -1. head by itself
is another way to view a file. Without any options, it prints the first
ten lines of a file while head -1 prints just the first
line.
Just imagine that we are only interested in the start positions of
the mapped reads. As this is a tab delimited file, we can use
cut to extract the columns we are interested in. In this
case, we are interersted in column 4.
test_user_1@bioinfo:~/shared-data/tut_1$ samtools view example.bam | cut -f4
16397254
16397297
211746504
211746661
128312738
128312890
42990216
42990372
61628270
61628328
73113661
73113907
If this file was comma delimited, we could have to tell the
cut command to use a comma as the delimiter instead of the
default tab. Let’s create a comma delimited version of this SAM file and
then use cut to extract that same fourth column from it. To replace all
the tabs with commas, we can use the command sed which has
it’s own sophisticated syntax.
test_user_1@bioinfo:~/shared-data/tut_1$ samtools view example.bam | sed 's/\t/\,/g' | head -1
LH00196:47:22CNY5LT3:6:2135:6659:6832,163,1,16397254,255,100M,=,16397297,144,GGGCTTTGCCCGCTGTGTGCCTGTCACCACCCACCAGCAGTCATGCCCTGGGCTTCCCAAATGGAGAGGTAGCAGGCAACGTTTTTAAAAAGAAAGAAAA,FFFFFFFFFFFFFFFFFF5FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,NH:i:1,HI:i:1,AS:i:199,nM:i:0,RG:Z:JCU1701
Let’s break down this sed command. The first
s tells sed to conduct a replace function and what we want
to do is replace all the tabs \t with commas
\,. The g at the end is a global flag and
tells it to do the replace for all tabs, not just the first one it sees.
sed is very powerful and I recommend investigating
everything it can be used for at some stage.
If you understand all that, then it’s time for a small challenge.
Take the previous command and send the output to a new file called
example.csv and then use the cut command to
print the fourth column like we did with the original bam file. If you
can’t remember how to write the output to a file, go right back to the
start of this text when we created hello.txt and also
remember to talk to your neighbours to work it out!
Now what if we just wanted to extract reads that mapped to eight
locations in the genome? For the sake of this exercise, don’t worry that
you don’t know what that means in this context, just know that you
should be trying to extract reads that contain the NH:i:8
flag in column 12. To do this, we can use linux’s find command
grep.
test_user_1@bioinfo:~/shared-data/tut_1$ samtools view example.bam | grep 'NH:i:8'
LH00196:47:22CNY5LT3:8:2154:42108:21028 163 21 8210701 0 58M = 8210762 161 GTTCCGACCATAAACGATGCCGACCGGCGATGCGGCGGCGTTATTCCCATGACCCGCC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:8HI:i:1 AS:i:156 nM:i:0 RG:Z:JCU1701
LH00196:47:22CNY5LT3:8:2154:42108:21028 83 21 8210762 0 100M = 8210701 -161 CAGCTTCCGGGAAACCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTTGCAAAGCTGAAACTTAAAGGAATTGACGGAAGGGCACCACCAGGAGTGGAGCC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:8 HI:i:1 AS:i:156 nM:i:0 RG:Z:JCU1701
LH00196:47:22CNY5LT3:8:2154:42108:21028 419 21 8254909 0 58M = 8254970 161 GTTCCGACCATAAACGATGCCGACCGGCGATGCGGCGGCGTTATTCCCATGACCCGCC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:8HI:i:3 AS:i:156 nM:i:0 RG:Z:JCU1701
LH00196:47:22CNY5LT3:8:2154:42108:21028 339 21 8254970 0 100M = 8254909 -161 CAGCTTCCGGGAAACCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTTGCAAAGCTGAAACTTAAAGGAATTGACGGAAGGGCACCACCAGGAGTGGAGCC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:8 HI:i:3 AS:i:156 nM:i:0 RG:Z:JCU1701
LH00196:47:22CNY5LT3:8:2154:42108:21028 419 21 8393736 0 58M = 8393797 161 GTTCCGACCATAAACGATGCCGACCGGCGATGCGGCGGCGTTATTCCCATGACCCGCC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:8HI:i:8 AS:i:156 nM:i:0 RG:Z:JCU1701
LH00196:47:22CNY5LT3:8:2154:42108:21028 339 21 8393797 0 100M = 8393736 -161 CAGCTTCCGGGAAACCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTTGCAAAGCTGAAACTTAAAGGAATTGACGGAAGGGCACCACCAGGAGTGGAGCC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:8 HI:i:8 AS:i:156 nM:i:0 RG:Z:JCU1701
LH00196:47:22CNY5LT3:8:2154:42108:21028 419 21 8437946 0 58M = 8438007 161 GTTCCGACCATAAACGATGCCGACCGGCGATGCGGCGGCGTTATTCCCATGACCCGCC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:8HI:i:6 AS:i:156 nM:i:0 RG:Z:JCU1701
LH00196:47:22CNY5LT3:8:2154:42108:21028 339 21 8438007 0 100M = 8437946 -161 CAGCTTCCGGGAAACCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTTGCAAAGCTGAAACTTAAAGGAATTGACGGAAGGGCACCACCAGGAGTGGAGCC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:8 HI:i:6 AS:i:156 nM:i:0 RG:Z:JCU1701
LH00196:47:22CNY5LT3:8:2154:42108:21028 419 22_KI270733v1_random 127001 0 58M = 127062 161 GTTCCGACCATAAACGATGCCGACCGGCGATGCGGCGGCGTTATTCCCATGACCCGCC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:8 HI:i:4 AS:i:156 nM:i:0 RG:Z:JCU1701
LH00196:47:22CNY5LT3:8:2154:42108:21028 339 22_KI270733v1_random 127062 0 100M = 127001 -161 CAGCTTCCGGGAAACCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTTGCAAAGCTGAAACTTAAAGGAATTGACGGAAGGGCACCACCAGGAGTGGAGCC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:8 HI:i:4 AS:i:156 nM:i:0 RG:Z:JCU1701
LH00196:47:22CNY5LT3:8:2154:42108:21028 419 22_KI270733v1_random 172082 0 58M = 172143 161 GTTCCGACCATAAACGATGCCGACCGGCGATGCGGCGGCGTTATTCCCATGACCCGCC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:8 HI:i:5 AS:i:156 nM:i:0 RG:Z:JCU1701
LH00196:47:22CNY5LT3:8:2154:42108:21028 339 22_KI270733v1_random 172143 0 100M = 172082 -161 CAGCTTCCGGGAAACCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTTGCAAAGCTGAAACTTAAAGGAATTGACGGAAGGGCACCACCAGGAGTGGAGCC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:8 HI:i:5 AS:i:156 nM:i:0 RG:Z:JCU1701
LH00196:47:22CNY5LT3:8:2154:42108:21028 419 Un_GL000220v1 110148 0 58M = 110209 161 GTTCCGACCATAAACGATGCCGACCGGCGATGCGGCGGCGTTATTCCCATGACCCGCC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:8 HI:i:2 AS:i:156 nM:i:0 RG:Z:JCU1701
LH00196:47:22CNY5LT3:8:2154:42108:21028 339 Un_GL000220v1 110209 0 100M = 110148 -161 CAGCTTCCGGGAAACCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTTGCAAAGCTGAAACTTAAAGGAATTGACGGAAGGGCACCACCAGGAGTGGAGCC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:8 HI:i:2 AS:i:156 nM:i:0RG:Z:JCU1701
LH00196:47:22CNY5LT3:8:2154:42108:21028 419 Un_GL000220v1 154120 0 58M = 154181 161 GTTCCGACCATAAACGATGCCGACCGGCGATGCGGCGGCGTTATTCCCATGACCCGCC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:8 HI:i:7 AS:i:156 nM:i:0 RG:Z:JCU1701
LH00196:47:22CNY5LT3:8:2154:42108:21028 339 Un_GL000220v1 154181 0 100M = 154120 -161 CAGCTTCCGGGAAACCAAAGTCTTTGGGTTCCGGGGGGAGTATGGTTGCAAAGCTGAAACTTAAAGGAATTGACGGAAGGGCACCACCAGGAGTGGAGCC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:8 HI:i:7 AS:i:156 nM:i:0RG:Z:JCU170
This takes the output of samtools view, pipes it to
grep and then prints every line that contains the requried
flag. You can use this to extract lines that contain whatever it is you
like.
Next I want to explain for loops. This is a function
that repeats a set of commands on a list of inputs. Consider the
following for loop that executes a series of commands on every file in
the current folder that ends in .fa.
test_user_1@bioinfo:~/shared-data/tut_1$ for i in *fa ; do
grep '>' $i ;
done
>sequence_1
>sequence_2
>sequence_1
Let’s break this down. First up, the for loop introduces the variable
i and it assigns that variable to every file in the current
directory that ends in fa, taking note that the use of the
wildcard * means that we don’t care what the file is
called, as long as the last two letters of that file name are f and a.
So for every file that ends in fa, it then does a search of its contents
to extract every line that contains a ‘>’ character. The
$i denotes the variable i introduced at the
beginning. If you remember from when we introduced the fasta format, the
> character denotes the header line of each entry. This
probably isn’t the greatest example of a for loop and the same sort of
output can be obtained in other ways (see below), but if you get used to
the syntax of for loops, you’ll find yourself using them over and over
again.
test_user_1@bioinfo:~/shared-data/tut_1$ grep '>' multi_fasta.fa single_fasta.fa
multi_fasta.fa:>sequence_1
multi_fasta.fa:>sequence_2
single_fasta.fa:>sequence_1
test_user_1@bioinfo:~/shared-data/tut_1$ grep '>' *fa
multi_fasta.fa:>sequence_1
multi_fasta.fa:>sequence_2
single_fasta.fa:>sequence_1
The last thing we are going to try in bash is to create a small
executable script. Up until this point, we’ve just been using the
interactive bash shell but we can also use bash non-interactively. First
up, navigate to your ~/working-directory directory, create
a new folder here called bash_scripts and then move to this
new location.
To write a script, we need a text editor and bash has a few options
built in. I like to use vim but other options include
emacs and nano. I’m going to use
vim here but feel free to use on of the others if you are
familiar with them.
First up, type in vim my_first_script.sh into the
command line and then hit enter. Remember to use a
_ instead of a space, otherwise it will treat
my, first and script.sh as three
separate strings. This brings up a blank screen which you can think of
as something akin to document in Word. Vim has a syntax completely
different to anything we’ve seen so far so if you choose to use it, I
encourage you to get familiar with its idiosyncrasies. A good place to
start is here.
To start editing the script, hit the i key. This allows
you to start editing. The first line of a bash script is called the
shebang. This tells the system to use the bash interpreter.
The shebang should look like this:
#!/bin/sh
Next up, our script needs some contents. My example below is simply
going to print a string to the screen upon execution but I want you to
try something more complex. For example, maybe you could get your script
to list all the files in the current folder that are either fastas or
fastqs. Maybe you could get it to list all the files in the folder that
were created today hint you might need to use the output of
ll for this. Maybe your script could extract every line in
all the fastqs in your current folder that start with an @
symbol. It’s really up to you so try a few things out to see what can be
done with the tools we’ve learnt today. This is my extremely boring
script.
#!/bin/sh
echo "Hello World!"
Once you are done, click esc to get stop editing the doc
and then type :wq and then enter. The w saves
the document and the q quits vim. Now to execute your
script, type:
sh ./my_first_script.sh
and then enter. Another option is to make your script
executable by changing the permission. This way you can avoid the
sh command when executing it. To do this type
chmod +x my_first_script.sh . Now all you need to do to
execute it it to type:
./my_first_script.sh
If this all works well, your script should execute properly and you should see the output printed to the screen but be aware of the folder that you are in and the location of your script as the two may differ.
This has been just a brief introduction to Bash scripting but there is an enormous amount more that can be done here that we don’t have time to cover today. Next up we are going to look at RStudio and see how we can navigate around the same filesystem as we just did with bash.
RStudio is an IDE (Integrated Development Environment) for the R programming language. You can access our course RStudio environment here.
Your RStudio window is most likely broken up into four panes. The top left is typically where you have a script that you are working on, the top right is your Envrionment which lists all loaded objects, the bottom right shows your filesystem and the bottom left is the console. You can see that each of these has multiple tabs, some of which we will explore today.
First up, let’s look at the bottom right pane. You should recognise
the strucure of your home directory from all the work we’ve been doing
in Bash. If you click on the scripts folder, you should see
your my_first_script.sh that you just created.
Let’s have a look at the console now. R has a concept of the
working directory which is kind of like the address from
which all relative paths stem. To see the current directory, use
getwd()
> getwd()
[1] "/home/andrewc"
ls and ll also have equivalents in R:
> list.files()
[1] "files" "R" "setup_bats.sh" "Test"
>
> file.info(list.files())
size isdir mode mtime ctime atime uid gid uname grname
files 4096 TRUE 755 2025-07-31 00:34:56 2025-07-31 00:34:56 2025-07-31 00:35:01 1006 1007 andrewc andrewc
R 4096 TRUE 755 2025-07-30 00:50:19 2025-07-30 00:50:19 2025-07-30 00:52:39 1006 1007 andrewc andrewc
setup_bats.sh 129 FALSE 644 2024-04-04 07:02:39 2024-07-21 06:54:33 2025-07-30 00:16:15 0 0 root root
Test 4096 TRUE 755 2025-07-30 03:11:34 2025-07-30 03:11:34 2025-07-30 03:11:35 1006 1007 andrewc andrewc
If we really want to, can run non R scripts through R by using the
system() command:
> system("scripts/hello.sh")
Hello World!
R can be used both interactively and non-interactively, just as can
be done with bash. To make your first R script, create a folder in the
~/working-directory directory called r_scripts
using the New Folder button in the bottom right pane and
then use the top left pane to edit and save the script to the correct
location.
In the image above you can see that I have created a script called
my_first_r_script.R and saved it in my
home/r_scripts folder. The script (top left) can be
executed within the console (bottom right) using the
source() command or you can execute it by clicking the
run button in the top left pane.
For a usecase as simple as this though, it probably makes more sense
to run the script interactively. To do this, simply type the command
directly into the console and then hit enter.
> print("Hello World!")
[1] "Hello World!"
R has a lot of advantages over Bash, one being that it can handle
more complicated data structures that can bundle together data and
functions into objects. It also has an enormous and ever
growing repository of libraries which enable complex functions so that
you don’t need to code every single function every single time you use
it.
Let’s have a look at an object called a data frame that
would be very difficult to replicate in Bash.
my_df <- data.frame(
name = c("Alice", "Bob", "Mary"), # character
age = c(25, 30, 27), # numeric
married = c(TRUE, FALSE, FALSE) # logical
)
This creates the my_df object which will be visible in your
environment pane. This example data frame combines three types of data -
characters, numbers and a logical TRUE/FALSE data type. A related type
of object called a matrix is very similar to a
data frame but differs because it only contains a single
type of data, eg. numerical. We can now query this using a bunch of
built in functions. Let’s try out a few.
> # Mean age of people
> mean(my_df$age)
[1] 27.33333
> # Print all names
> my_df$name
[1] "Alice" "Bob" "Mary"
> # Print a summary of all data in the df
> summary(my_df)
name age married
Length:3 Min. :25.00 Mode :logical
Class :character 1st Qu.:26.00 FALSE:2
Mode :character Median :27.00 TRUE :1
Mean :27.33
3rd Qu.:28.50
Max. :30.00
> # Plot all ages (view results in bottom right pane)
> plot(my_df$age)
> # Identify older unmarried people and subset them
> older_umarried <- my_df[my_df$age > 25 & my_df$married == FALSE, ]
> older_umarried
name age married
2 Bob 30 FALSE
3 Mary 27 FALSE
In a data frame, the $ symbol is uesed to refer to a
particular column. In the first example above, we wanted to determine
the mean of all ages so we used the mean() function on
my_df$age. Try do the same on $name or
$married to see what happens. In the fourth example, we
produced a simple scatter plot of the age data using the
plot() function. Try the same for $married and
$name to see what happens. If you want to investigate all
the option available to the plot() function, type `?plot’
and have a read in the bottom right hand pane.
In the final example we created a new data frame object called
older_unmarried by subsetting my_df. To do
this, we first indicate that we want to subset the object
my_df by defining it with my_df[]. The code
inside the square brackets describes how we would like to subset it. In
our case we are setting two conditions to define our subset and these
are linked with a & symbol. Read this line carefully to
make sure you understand it.
Have a play around with this object to create your own subsets. Maybe create a new, larger data frame and try out other ways to create subsets.
Next up we’re going to look at functions. Just say
there’s an operation that we are going to need to use over and over
again, rather than code this function every time we need it, we have the
option of writing it up once as a function and then saving it as a
.R file to be reused whenever we like in future. Ususally
you’d do this for reasonably complex operations with lots of steps but
for demonstrative purposes here, we are going to code up a very simple
function that does a single thing - lists all the fasta files in a given
directory. This is how we define our function:
> list_fasta_files <- function(directory = ".") {
}
If we run this code chunk, we will produce a function called
list_fasta_files and this will be listed in our environment pane, but it
will not do anything yet because we havene’t filled in the bit between
the {} yet. What we have defined so far is the name of the
function list_fasta_files, the fact that
list_fasta_files is a function and that our function will
operate on variable that we have called directory. We have
also defined directory to have the default value of
. which means our working directory. Our choice of calling
the variable directory was purely our choice that we
selected so that anyone reading the code would understand that this
function should operate on directories, but if we wanted to, we could
have chosen squid or swedish_death_metal or
oompaloompa and the code would work perfectly fine, however
it would be less readable by someone else or by yourself in the
future once you’ve kind of forgotten what this whole project was
about.
So let’s fill in the {}.
list_fasta_files <- function(directory = ".") {
# Get all files in specified directory with .fasta, .fa or .fas extensions
fasta_files <- list.files(path = directory,
pattern = "\\.(fasta|fa|fas)$",
ignore.case = TRUE)
# Return the list of files
return(fasta_files)
}
If it’s not clear already, any line that starts a # is
called a comment. These lines have no bearing on the
execution of the code but they are included to help anyone reading the
code to understand what a code chunk does or why its there. You can NOT
use too many comments in your code. Good commenting is extremely
important to make your code more accessible to others and to your future
self.
Alright, back to the function, have a read through of what we just
did to make sure you undersand it. Our function is going to create an
object called fasta_files that is a list of
all the files that exist in the directory defined by the variable
directory that have a suffix of either .fasta,
.fa or .fas - three common ways to name fasta
files. The \\ is an escape syntax to define a literal
.. Our function also lists files whether the names are
capitalised or not. After this, it will print out the contents of the
fasta_files list. Now let’s execute it, first using the
default directory and then afterwards by directing it to look in our
files directory.
> list_fasta_files("files")
[1] "multi_fasta.fa" "single_fasta.fa"
> list_fasta_files()
character(0)
> list_fasta_files("/home/andrewc/files")
[1] "multi_fasta.fa" "single_fasta.fa"
Beautiful. Alright, let’s save this to a file now and then we’ll have a go at importing it an using it again.
> dir.create("functions")
> # Save the function
> dump("list_fasta_files", file = "functions/my_first_function.R")
Now to use it, you need to use the source command. The
beauty of this approach is that if you decide at our tutorial next week
that you need to list fasta files again, you can simply
source() this function again and use it straight away.
> source("/home/andrewc/functions/my_first_function.R")
> my_fasta_files <- list_fasta_files("files")
> my_fasta_files
[1] "multi_fasta.fa" "single_fasta.fa"
Another great feature of R is R Markdown, which is a file format that
allows you to create dynamic documents that combines text and executable
code into a final report. Let’s open an R Markdown doc and test it out.
Click file, New File, R Markdown.
This opens a template that already includes some text and code
blocks.
At the top there is a header bracketed by --- at the top
and bottom. By default it chooses to produce an html report but if you
prefer pdf, change it here. The line knitr::opts_chunk$set(echo = TRUE)
sets global default options for all code chunks in your R Markdown
document. echo = TRUE means that it will output the code
and the output in the final document. If you just want the output and
not the code, change this to FALSE. Following this, we have
a block of text with the header maked by ##. Under the text
block we have our first code block which gives a summary of an in-built
dataset called cars. Next we have another text block
followed by a code block that plots another in-built dataset called
pressure. If you want to see the output of this document,
click Knit at the top. The first time you do this, you’ll
have to save your file. Make sure you use the .rmd suffix
to show that its an R Markdown type file. Try switching between html and
pdf to see the difference.
R Markdown is a fantastic way to make dynamic documents that execute
underlying code and if the code or the data ever changes, you just hit
the Knit button again to make the necessary updates. For
this reason, it’s brilliant for reproducibility which
is an essential trait of good bioinformaticians. When we present
results, we need to show how we came to those results and the best way
to do that is to present the underlying code along with the results. R
Markdown combines both of those things into a single document which also
makes it excellent for producing reports for your university lecturers
to mark.